@node POSIX regular expressions
@section Regular expressions

@cindex string matching
The procedures in this section provide access to POSIX regular
expression matching.  The regular expression syntax and semantics are
far too complex to be described here.

@strong{Note:} Because the C interface uses ASCII @code{NUL} bytes to
mark the ends of strings, patterns & strings that contain @code{NUL}
characters will not work correctly.

@subsection Direct POSIX regular expression interface

@stindex posix-regexps
The first interface to regular expressions is a thin layer over the
interface that POSIX provides.  It is exported by the structures
@code{posix-regexps} & @code{posix}.

@deffn procedure make-regexp string option @dots{} @returns{} regexp
@deffnx procedure regexp? object @returns{} boolean
@code{Make-regexp} creates a regular expression with the given string
pattern.  The arguments after @var{string} specify various options for
the regular expression; see @code{regexp-option} below.  The regular
expression is not compiled until it is matched against a string, so any
errors in the pattern string will not be reported until that point.
@code{Regexp?} is the disjoint type predicate for regular expression
objects.
@end deffn

@deffn syntax regexp-option name @returns{} regexp-option
Evaluates to a regular expression option, suitable to be passed to
@code{make-regexp}, with the given name.  The possible option names
are:

@table @code
@item extended
use the extended patterns
@item ignore-case
ignore case differences when matching
@item submatches
report submatches
@item newline
treat newlines specially
@end table
@end deffn

@deffn procedure regexp-match regexp string start submatches? starts-line? ends-line? @returns{} boolean or list of matches
@code{Regexp-match} matches @var{regexp} against the characters in
@var{string}, starting at position @var{start}.  If the string does not
match the regular expression, @code{regexp-match} returns @code{#f}.
If the string does match, then a list of match records is returned if
@var{submatches?} is true or @code{#t} if @var{submatches?} is false.
The first match record gives the location of the substring that matched
@var{regexp}.  If the pattern in @var{regexp} contained submatches,
then the submatches are returned in order, with match records in the
positions where submatches succeeded and @code{#f} in the positions
where submatches failed.  

@var{Starts-line?} should be true if @var{string} starts at the
beginning of a line, and @var{ends-line?} should be true if it ends
one.
@end deffn

@deffn procedure match? object @returns{} boolean
@deffnx procedure match-start match @returns{} integer
@deffnx procedure match-end match @returns{} integer
@deffnx procedure match-submatches match @returns{} alist
@code{Match?} is the disjoint type predicate for match records.  Match
records contain three values: the beginning & end of the substring that
matched the pattern and an association list of submatch keys and
corresponding match records for any named submatches that also matched.
@code{Match-start} returns the index of the first character in the
matching substring, and @code{match-end} gives the index of the
first character after the matching substring.  @code{Match-submatches}
returns the alist of submatches.
@end deffn

@subsection High-level regular expression construction

@stindex regexp
This section describes a functional interface for building regular
expressions and matching them against strings, higher-level than the
direct POSIX interface.  The matching is done using the POSIX regular
expression package.  Regular expressions constructed by procedures
listed here are compatible with those in the previous section; that is,
they satisfy the predicate @code{regexp?} from the @code{posix-regexps}
structure.  These names are exported by the structure @code{regexps}.

@subsubsection Character sets

Character sets may be defined using a list of characters and strings,
using a range or ranges of characters, or by using set operations on
existing character sets.

@deffn procedure set char-or-string @dots{} @returns{} char-set-regexp
@deffnx procedure range low-char high-char @returns{} char-set-regexp
@deffnx procedure ranges low-char high-char @dots{} @returns{} char-set-regexp
@deffnx procedure ascii-range low-char high-char @returns{} char-set-regexp
@deffnx procedure ascii-ranges low-char high-char @dots{} @returns{} char-set-regexp
@code{Set} returns a character set that contains all of the character
arguments and all of the characters in all of the string arguments.
@code{Range} returns a character set that contains all characters
between @var{low-char} and @var{high-char}, inclusive.  @code{Ranges}
returns a set that contains all of the characters in the given set of
ranges.  @code{Range} & @code{ranges} use the ordering imposed by
@code{char->integer}.  @code{Ascii-range} & @code{ascii-ranges} are
like @code{range} & @code{ranges}, but they use the ASCII ordering.
@code{Ranges} & @code{ascii-ranges} must be given an even number of
arguments.  It is an error for a @var{high-char} to be less than the
preceding @var{low-char} in the appropriate ordering.
@end deffn

@deffn procedure negate char-set @returns{} char-set-regexp
@deffnx procedure union char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
@deffnx procedure intersection char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
@deffnx procedure subtract char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
Set operations on character sets.  @code{Negate} returns a character
set of all characters that are not in @var{char-set}.  @code{Union}
returns a character set that contains all of the characters in
@var{char-set@suba{a}} and all of the characters in
@var{char-set@suba{b}}.  @code{Intersection} returns a character set of
all of the characters that are in both @var{char-set@suba{a}} and
@var{char-set@suba{b}}.  @code{Subtract} returns a character set of all
the characters in @var{char-set@suba{a}} that are not also in
@var{char-set@suba{b}}.
@end deffn

@defvr {character set} lower-case = @code{(set "abcdefghijklmnopqrstuvwxyz")}
@defvrx {character set} lower-case = @code{(set "abcdefghijklmnopqrstuvwxyz")}
@defvrx {character set} upper-case = @code{(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")}
@defvrx {character set} alphabetic = @code{(union lower-case upper-case)}
@defvrx {character set} numeric = @code{(set "0123456789")}
@defvrx {character set} alphanumeric = @code{(union alphabetic numeric)}
@defvrx {character set} punctuation = @code{(set "!\"#$%&'()*+,-./:;<=>?@@[\\]^_`@{|@}~")}
@defvrx {character set} graphic = @code{(union alphanumeric punctuation)}
@defvrx {character set} printing = @code{(union graphic (set #\space))}
@defvrx {character set} control = @code{(negate printing)}
@defvrx {character set} blank = @code{(set #\space (ascii->char 9))     ; ASCII 9 = TAB}
@defvrx {character set} whitespace = @code{(union (set #\space) (ascii-range 9 13))}
@defvrx {character set} hexdigit = @code{(set "0123456789ABCDEF")}
Predefined character sets.
@end defvr

@subsubsection Anchoring

@deffn procedure string-start @returns{} regexp
@deffnx procedure string-end @returns{} regexp
@code{String-start} returns a regular expression that matches the
beginning of the string being matched against; @code{string-end}
returns one that matches the end.
@end deffn

@subsubsection Composite expressions

@deffn procedure sequence regexp @dots{} @returns{} regexp
@deffnx procedure one-of regexp @dots{} @returns{} regexp
@code{Sequence} returns a regular expression that matches
concatenation of all of its arguments; @code{one-of} returns a regular
expression that matches any one of its arguments.
@end deffn

@deffn procedure text string @returns{} regexp
Returns a regular expression that matches exactly the characters in
@var{string}, in order.
@end deffn

@deffn procedure repeat regexp @returns{} regexp
@deffnx procedure repeat count regexp @returns{} regexp
@deffnx procedure repeat min max regexp @returns{} regexp
@code{Repeat} returns a regular expression that matches zero or more
occurrences of its @var{regexp} argument.  With only one argument, the
result will match @var{regexp} any number of times.  With two
arguments, @ie{} one @var{count} argument, the returned regular
expression will match @var{regexp} exactly that number of times.  The
final case will match from @var{min} to @var{max} repetitions,
inclusive.  @var{Max} may be @code{#f}, in which case there is no
maximum number of matches.  @var{Count} & @var{min} must be exact,
non-negative integers; @var{max} should be either @code{#f} or an
exact, non-negative integer.
@end deffn

@subsubsection Case sensitivity

Regular expressions are normally case-sensitive, but case sensitivity
can be manipulated simply.

@deffn procedure ignore-case regexp @returns{} regexp
@deffnx procedure use-case regexp @returns{} regexp
The regular expression returned by @code{ignore-case} is identical to
its argument except that the case will be ignored when matching.  The
value returned by @code{use-case} is protected from future applications
of @code{ignore-case}.  The expressions returned by @code{use-case} and
@code{ignore-case} are unaffected by any enclosing uses of these
procedures.

By way of example, the following matches @code{"ab"}, but not
@code{"aB"}, @code{"Ab"}, or @code{"AB"}:

@lisp
(text "ab")@end lisp

@noindent
while

@lisp
(ignore-case (text "ab"))@end lisp

@noindent
matches all of those, and

@lisp
(ignore-case (sequence (text "a")
                       (use-case (text "b"))))@end lisp

@noindent
matches @code{"ab"} or @code{"Ab"}, but not @code{"aB"} or @code{"AB"}.
@end deffn

@subsubsection Submatches and matching

A subexpression within a larger expression can be marked as a submatch.
When an expression is matched against a string, the success or failure
of each submatch within that expression is reported, as well as the
location of the substring matched by each successful submatch.

@deffn procedure submatch key regexp @returns{} regexp
@deffnx procedure no-submatches regexp @returns{} regexp
@code{Submatch} returns a regular expression that is equivalent to
@var{regexp} in every way except that the regular expression returned by
@code{submatch} will produce a submatch record in the output for the
part of the string matched by @var{regexp}.  @code{No-submatches}
returns a regular expression that is equivalent to @var{regexp} in every
respect except that all submatches generated by @var{regexp} will be
ignored & removed from the output.
@end deffn

@deffn procedure any-match? regexp string @returns{} boolean
@deffnx procedure exact-match? regexp string @returns{} boolean
@deffnx procedure match regexp string @returns{} match or @code{#f}
@code{Any-match?} returns @code{#t} if @var{string} matches
@var{regexp} or contains a substring that does, or @code{#f} if
otherwise.  @code{Exact-match?} returns @code{#t} if @var{string}
matches @var{regexp} exactly, or @code{#f} if it does not.

@code{Match} returns @code{#f} if @var{string} does not match
@var{regexp}, or a match record if it does, as described in the
previous section.  Matching occurs according to POSIX.  The match
returned is the one with the lowest starting index in @var{string}.  If
there is more than one such match, the longest is returned.  Within
that match, the longest possible submatches are returned.

All three matching procedures cache a compiled version of @var{regexp}.
Subsequent calls with the same input regular expression will be more
efficient.
@end deffn

Here are some examples of the high-level regular expression interface:

@lisp
(define pattern (text "abc"))

(any-match? pattern "abc")            @result{} #t
(any-match? pattern "abx")            @result{} #f
(any-match? pattern "xxabcxx")        @result{} #t

(exact-match? pattern "abc")          @result{} #t
(exact-match? pattern "abx")          @result{} #f
(exact-match? pattern "xxabcxx")      @result{} #f

(let ((m (match (sequence (text "ab")
                          (submatch 'foo (text "cd"))
                          (text "ef")))
         "xxabcdefxx"))
  (list m (match-submatches m)))
    @result{} (#@{Match 3 9@} ((foo . #@{Match 5 7@})))

(match-submatches
 (match (sequence (set "a")
                  (one-of (submatch 'foo (text "bc"))
                          (submatch 'bar (text "BC"))))
        "xxxaBCd"))
    @result{} ((bar . #@{Match 4 6@}))@end lisp
